{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3",
   "display_name": "DataAnalysis"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "import pylab\n",
    "from nltk import word_tokenize\n",
    "from nltk.corpus import brown\n",
    "from tools import show_subtitle\n",
    "%matplotlib inline"
   ]
  },
  {
   "source": [
    "# Ch5 分类和标注词汇\n",
    "\n",
    "1.  什么是词汇分类，在自然语言处理中它们如何使用？\n",
    "2.  对于存储词汇和它们的分类来说什么是好的 Python 数据结构？\n",
    "3.  如何自动标注文本中每个词汇的词类？\n",
    "\n",
    "-   词性标注（parts-of-speech tagging，POS tagging）：简称标注。将词汇按照它们的词性（parts-of-speech，POS）进行分类并对它们进行标注\n",
    "-   词性：也称为词类或者词汇范畴。\n",
    "-   标记集：用于特定任务标记的集合。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "brown_tagged_sents = brown.tagged_sents(categories='news')\n",
    "brown_sents = brown.sents(categories='news')\n",
    "brown_tagged_words = brown.tagged_words(categories='news')\n",
    "brown_words = brown.words(categories='news')"
   ]
  },
  {
   "source": [
    "## 5.5 N元语法标注器\n",
    "xxxTagger() 只能使用 sent 作为训练语料"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 5.5.1 一元标注器，统计词料库中每个单词标注最多的词性作为一元语法模型的建立基础"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >unigram_tagger.tag(brown_sents[2007])< ---------------\n[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]\n"
     ]
    },
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0.9349006503968017"
      ]
     },
     "metadata": {},
     "execution_count": 3
    }
   ],
   "source": [
    "# 使用训练数据来评估一元标注器的准确度\n",
    "unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)\n",
    "show_subtitle(\"unigram_tagger.tag(brown_sents[2007])\")\n",
    "print(unigram_tagger.tag(brown_sents[2007]))\n",
    "unigram_tagger.evaluate(brown_tagged_sents)"
   ]
  },
  {
   "source": [
    "### 5.5.2 将数据分为 训练集 和 测试集"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0.8121200039868434"
      ]
     },
     "metadata": {},
     "execution_count": 4
    }
   ],
   "source": [
    "# 使用训练数据来训练一元标注器，使用测试数据来评估一元标注器的准确度\n",
    "size = int(len(brown_tagged_sents) * 0.9)\n",
    "train_sents = brown_tagged_sents[:size]\n",
    "test_sents = brown_tagged_sents[size:]\n",
    "unigram_tagger = nltk.UnigramTagger(train_sents)\n",
    "unigram_tagger.evaluate(test_sents)"
   ]
  },
  {
   "source": [
    "### 5.5.3 更加一般的N元标注器"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >bigram_tagger.tag(train_sents[2007])< ---------------\n[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]\n--------------- >bigram_tagger.tag(brown_sents[4203])< ---------------\n[('The', 'AT'), ('population', 'NN'), ('of', 'IN'), ('the', 'AT'), ('Congo', 'NP'), ('is', 'BEZ'), ('13.5', None), ('million', None), (',', None), ('divided', None), ('into', None), ('at', None), ('least', None), ('seven', None), ('major', None), ('``', None), ('culture', None), ('clusters', None), (\"''\", None), ('and', None), ('innumerable', None), ('tribes', None), ('speaking', None), ('400', None), ('separate', None), ('dialects', None), ('.', None)]\n"
     ]
    },
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0.10206319146815508"
      ]
     },
     "metadata": {},
     "execution_count": 5
    }
   ],
   "source": [
    "# 二元标注器\n",
    "bigram_tagger = nltk.BigramTagger(train_sents)\n",
    "\n",
    "# 标注训练集中数据\n",
    "show_subtitle(\"bigram_tagger.tag(train_sents[2007])\")\n",
    "print(bigram_tagger.tag(brown_sents[2007]))\n",
    "\n",
    "# 标注测试集中数据\n",
    "show_subtitle(\"bigram_tagger.tag(brown_sents[4203])\")\n",
    "print(bigram_tagger.tag(brown_sents[4203]))\n",
    "\n",
    "bigram_tagger.evaluate(test_sents)  # 整体准确度很低，是因为数据稀疏问题"
   ]
  },
  {
   "source": [
    "### 5.5.4 组合标注器，效果更差，为什么？"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0.8361407355726104"
      ]
     },
     "metadata": {},
     "execution_count": 6
    }
   ],
   "source": [
    "t0 = nltk.DefaultTagger('NN')\n",
    "t1 = nltk.UnigramTagger(train_sents, backoff=t0)\n",
    "t1.evaluate(test_sents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0.8452108043456593"
      ]
     },
     "metadata": {},
     "execution_count": 7
    }
   ],
   "source": [
    "t2 = nltk.BigramTagger(train_sents, backoff=t1)\n",
    "t2.evaluate(test_sents)  # 这个效果最好"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0.843317053722715"
      ]
     },
     "metadata": {},
     "execution_count": 8
    }
   ],
   "source": [
    "t3 = nltk.TrigramTagger(train_sents, backoff=t2)\n",
    "t3.evaluate(test_sents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0.8426193561247882"
      ]
     },
     "metadata": {},
     "execution_count": 9
    }
   ],
   "source": [
    "t2 = nltk.BigramTagger(train_sents, cutoff=1, backoff=t1)\n",
    "t2.evaluate(test_sents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0.8429183693810426"
      ]
     },
     "metadata": {},
     "execution_count": 10
    }
   ],
   "source": [
    "# cutoff=15时，准确率高，可见上下文并不能真正提示单词标注的内在规律\n",
    "t3 = nltk.TrigramTagger(train_sents, cutoff=15, backoff=t2)\n",
    "t3.evaluate(test_sents)"
   ]
  },
  {
   "source": [
    "### 5.5.5 标注未知的单词\n",
    "\n",
    "对于生词。可以使用回退到正则表达式标注器或者默认标注器，但是都无法利用上下文。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 5.5.6 标注器的存储"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "t2.tag(tokens)=  [('The', 'AT'), (\"board's\", 'NN$'), ('action', 'NN'), ('shows', 'NNS'), ('what', 'WDT'), ('free', 'JJ'), ('enterprise', 'NN'), ('is', 'BEZ'), ('up', 'RP'), ('against', 'IN'), ('in', 'IN'), ('our', 'PP$'), ('complex', 'JJ'), ('maze', 'NN'), ('of', 'IN'), ('regulatory', 'NN'), ('laws', 'NNS'), ('.', '.')]\n",
      "t2.evaluate(test_sents)=  0.8426193561247882\n"
     ]
    }
   ],
   "source": [
    "from pickle import dump, load\n",
    "\n",
    "text = \"\"\"The board's action shows what free enterprise\n",
    "    is up against in our complex maze of regulatory laws .\"\"\"\n",
    "tokens = text.split()\n",
    "\n",
    "output = open('t2.pkl', 'wb')\n",
    "dump(t2, output, -1)\n",
    "output.close()\n",
    "print(\"t2.tag(tokens)= \", t2.tag(tokens))\n",
    "print(\"t2.evaluate(test_sents)= \", t2.evaluate(test_sents))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "t2_bak.tag(tokens)=  [('The', 'AT'), (\"board's\", 'NN$'), ('action', 'NN'), ('shows', 'NNS'), ('what', 'WDT'), ('free', 'JJ'), ('enterprise', 'NN'), ('is', 'BEZ'), ('up', 'RP'), ('against', 'IN'), ('in', 'IN'), ('our', 'PP$'), ('complex', 'JJ'), ('maze', 'NN'), ('of', 'IN'), ('regulatory', 'NN'), ('laws', 'NNS'), ('.', '.')]\n",
      "t2_bak.evaluate(test_sents)=  0.8426193561247882\n"
     ]
    }
   ],
   "source": [
    "input = open('t2.pkl', 'rb')\n",
    "t2_bak = load(input)\n",
    "input.close()\n",
    "\n",
    "print(\"t2_bak.tag(tokens)= \", t2_bak.tag(tokens))\n",
    "print(\"t2_bak.evaluate(test_sents)= \", t2_bak.evaluate(test_sents))"
   ]
  },
  {
   "source": [
    "### 5.5.7 N元标注器的性能边界（上限）"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0.049297702068029296"
      ]
     },
     "metadata": {},
     "execution_count": 13
    }
   ],
   "source": [
    "# 一种方法是寻找有歧义的单词的数目，大约有1/20的单词可能有歧义\n",
    "# cfd无法正确赋值，因为有些句子的长度少于3个单词，影响了trigrams()函数的正确运行\n",
    "cfd = nltk.ConditionalFreqDist(\n",
    "        ((x[1], y[1], z[0]), z[1])\n",
    "        for sent in brown_tagged_sents \n",
    "        if len(sent) >= 3\n",
    "        for x, y, z in nltk.trigrams(sent))\n",
    "ambiguous_contexts = [\n",
    "        c\n",
    "        for c in cfd.conditions() \n",
    "        if len(cfd[c]) > 1\n",
    "]\n",
    "sum(\n",
    "        cfd[c].N()\n",
    "        for c in ambiguous_contexts\n",
    ") / cfd.N()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >0< ---------------\nlen(sent)=  25\ntag(sent)=  [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), (\"Atlanta's\", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), (\"''\", \"''\"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')]\nThe Fulton County AT NP-TL NN-TL\nFulton County Grand NP-TL NN-TL JJ-TL\nCounty Grand Jury NN-TL JJ-TL NN-TL\nGrand Jury said JJ-TL NN-TL VBD\nJury said Friday NN-TL VBD NR\nsaid Friday an VBD NR AT\nFriday an investigation NR AT NN\nan investigation of AT NN IN\ninvestigation of Atlanta's NN IN NP$\nof Atlanta's recent IN NP$ JJ\nAtlanta's recent primary NP$ JJ NN\nrecent primary election JJ NN NN\nprimary election produced NN NN VBD\nelection produced `` NN VBD ``\nproduced `` no VBD `` AT\n`` no evidence `` AT NN\nno evidence '' AT NN ''\nevidence '' that NN '' CS\n'' that any '' CS DTI\nthat any irregularities CS DTI NNS\nany irregularities took DTI NNS VBD\nirregularities took place NNS VBD NN\ntook place . VBD NN .\n--------------- >1< ---------------\nlen(sent)=  43\ntag(sent)=  [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlanta', 'NP-TL'), (\"''\", \"''\"), ('for', 'IN'), ('the', 'AT'), ('manner', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('the', 'AT'), ('election', 'NN'), ('was', 'BEDZ'), ('conducted', 'VBN'), ('.', '.')]\nThe jury further AT NN RBR\njury further said NN RBR VBD\nfurther said in RBR VBD IN\nsaid in term-end VBD IN NN\nin term-end presentments IN NN NNS\nterm-end presentments that NN NNS CS\npresentments that the NNS CS AT\nthat the City CS AT NN-TL\nthe City Executive AT NN-TL JJ-TL\nCity Executive Committee NN-TL JJ-TL NN-TL\nExecutive Committee , JJ-TL NN-TL ,\nCommittee , which NN-TL , WDT\n, which had , WDT HVD\nwhich had over-all WDT HVD JJ\nhad over-all charge HVD JJ NN\nover-all charge of JJ NN IN\ncharge of the NN IN AT\nof the election IN AT NN\nthe election , AT NN ,\nelection , `` NN , ``\n, `` deserves , `` VBZ\n`` deserves the `` VBZ AT\ndeserves the praise VBZ AT NN\nthe praise and AT NN CC\npraise and thanks NN CC NNS\nand thanks of CC NNS IN\nthanks of the NNS IN AT\nof the City IN AT NN-TL\nthe City of AT NN-TL IN-TL\nCity of Atlanta NN-TL IN-TL NP-TL\nof Atlanta '' IN-TL NP-TL ''\nAtlanta '' for NP-TL '' IN\n'' for the '' IN AT\nfor the manner IN AT NN\nthe manner in AT NN IN\nmanner in which NN IN WDT\nin which the IN WDT AT\nwhich the election WDT AT NN\nthe election was AT NN BEDZ\nelection was conducted NN BEDZ VBN\nwas conducted . BEDZ VBN .\n--------------- >2< ---------------\nlen(sent)=  35\ntag(sent)=  [('The', 'AT'), ('September-October', 'NP'), ('term', 'NN'), ('jury', 'NN'), ('had', 'HVD'), ('been', 'BEN'), ('charged', 'VBN'), ('by', 'IN'), ('Fulton', 'NP-TL'), ('Superior', 'JJ-TL'), ('Court', 'NN-TL'), ('Judge', 'NN-TL'), ('Durwood', 'NP'), ('Pye', 'NP'), ('to', 'TO'), ('investigate', 'VB'), ('reports', 'NNS'), ('of', 'IN'), ('possible', 'JJ'), ('``', '``'), ('irregularities', 'NNS'), (\"''\", \"''\"), ('in', 'IN'), ('the', 'AT'), ('hard-fought', 'JJ'), ('primary', 'NN'), ('which', 'WDT'), ('was', 'BEDZ'), ('won', 'VBN'), ('by', 'IN'), ('Mayor-nominate', 'NN-TL'), ('Ivan', 'NP'), ('Allen', 'NP'), ('Jr.', 'NP'), ('.', '.')]\nThe September-October term AT NP NN\nSeptember-October term jury NP NN NN\nterm jury had NN NN HVD\njury had been NN HVD BEN\nhad been charged HVD BEN VBN\nbeen charged by BEN VBN IN\ncharged by Fulton VBN IN NP-TL\nby Fulton Superior IN NP-TL JJ-TL\nFulton Superior Court NP-TL JJ-TL NN-TL\nSuperior Court Judge JJ-TL NN-TL NN-TL\nCourt Judge Durwood NN-TL NN-TL NP\nJudge Durwood Pye NN-TL NP NP\nDurwood Pye to NP NP TO\nPye to investigate NP TO VB\nto investigate reports TO VB NNS\ninvestigate reports of VB NNS IN\nreports of possible NNS IN JJ\nof possible `` IN JJ ``\npossible `` irregularities JJ `` NNS\n`` irregularities '' `` NNS ''\nirregularities '' in NNS '' IN\n'' in the '' IN AT\nin the hard-fought IN AT JJ\nthe hard-fought primary AT JJ NN\nhard-fought primary which JJ NN WDT\nprimary which was NN WDT BEDZ\nwhich was won WDT BEDZ VBN\nwas won by BEDZ VBN IN\nwon by Mayor-nominate VBN IN NN-TL\nby Mayor-nominate Ivan IN NN-TL NP\nMayor-nominate Ivan Allen NN-TL NP NP\nIvan Allen Jr. NP NP NP\nAllen Jr. . NP NP .\n"
     ]
    }
   ],
   "source": [
    "# Colquitt 就是那个错误的句子，在ca01文本文件中可以找到\n",
    "for i, sent in enumerate(brown_tagged_sents[:3]):\n",
    "    show_subtitle(str(i))\n",
    "    print(\"len(sent)= \", len(sent))\n",
    "    print(\"tag(sent)= \", sent)\n",
    "    if len(sent) >= 3:\n",
    "        for x, y, z in nltk.trigrams(sent):\n",
    "            print(x[0], y[0], z[0], x[1], y[1], z[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "<ConfusionMatrix: 51866/61604 correct>"
      ]
     },
     "metadata": {},
     "execution_count": 15
    }
   ],
   "source": [
    "# 一种方法是研究被错误标记的单词\n",
    "# ToDo: 可是显示出来的结果根本没有可视性呀？\n",
    "test_tags = [\n",
    "        tag\n",
    "        for sent in brown.sents(categories='editorial')\n",
    "        for (word, tag) in t2.tag(sent)\n",
    "]\n",
    "gold_tags = [\n",
    "        tag\n",
    "        for (word, tag) in brown.tagged_words(categories='editorial')\n",
    "]\n",
    "nltk.ConfusionMatrix(gold_tags, test_tags)"
   ]
  },
  {
   "source": [
    "跨句子边界的标注\n",
    "\n",
    "使用三元标注器时，跨句子边界的标注会使用上个句子的最后一个词+标点符号+这个句子的头一个词\n",
    "但是，两个句子中的词并没有相关性，因此需要使用已经标注句子的链表来训练、运行和评估标注器\n",
    "\n",
    "Ex5-5 句子层面的N-gram标注\n",
    "\n",
    "前面的组合标注器已经是跨句子边界的标注"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## 5.6 基于转换的标注\n",
    "n-gram标注器存在的问题：\n",
    "1.  表的大小（语言模型），对于trigram表会产生巨大的稀疏矩阵\n",
    "2.  上下文。n-gram标注器从上下文中获得的唯一信息是标记，而忽略了词本身。\n",
    "\n",
    "在本节中，利用Brill标注，这是一种归纳标注方法，性能好，使用的模型仅有n-gram标注器的很小一部分。\n",
    "\n",
    "Brill标注是基于转换的学习，即猜想每个词的标记，然后返回和修正错误的标记，陆续完成整个文档的修正。\n",
    "与n-gram标注一样，需要监督整个过程，但是不计数观察结果，只编制一个转换修正规则链表。\n",
    "\n",
    "Brill标注依赖的原则：规则是语言学可解释的。因此Brill标注可以从数据中学习规则，并且也只记录规则。\n",
    "而n-gram只是隐式的记住了规律，并没有将规律抽象出规则，从而记录了巨大的数据表。\n",
    "\n",
    "Brill转换规则的模板：在上下文中，替换T1为T2.\n",
    "-   每一条规则都根据其净收益打分 = 修正不正确标记的数目 - 错误修改正确标记的数目"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "Loading tagged data from treebank... \n",
      "Read testing data (200 sents/5251 wds)\n",
      "Read training data (800 sents/19933 wds)\n",
      "Read baseline data (800 sents/19933 wds) [reused the training set]\n",
      "Trained baseline tagger\n",
      "    Accuracy on test set: 0.8366\n",
      "Training tbl tagger...\n",
      "TBL train (fast) (seqs: 800; tokens: 19933; tpls: 24; min score: 3; min acc: None)\n",
      "Finding initial useful rules...\n",
      "    Found 12799 useful rules.\n",
      "\n",
      "           B      |\n",
      "   S   F   r   O  |        Score = Fixed - Broken\n",
      "   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct\n",
      "   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect\n",
      "   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect\n",
      "   e   d   n   r  |  e\n",
      "------------------+-------------------------------------------------------\n",
      "  23  23   0   0  | POS->VBZ if Pos:PRP@[-2,-1]\n",
      "  18  19   1   0  | NN->VB if Pos:-NONE-@[-2] & Pos:TO@[-1]\n",
      "  14  14   0   0  | VBP->VB if Pos:MD@[-2,-1]\n",
      "  12  12   0   0  | VBP->VB if Pos:TO@[-1]\n",
      "  11  11   0   0  | VBD->VBN if Pos:VBD@[-1]\n",
      "  11  11   0   0  | IN->WDT if Pos:-NONE-@[1] & Pos:VBP@[2]\n",
      "  10  11   1   0  | VBN->VBD if Pos:PRP@[-1]\n",
      "   9  10   1   0  | VBD->VBN if Pos:VBZ@[-1]\n",
      "   8   8   0   0  | NN->VB if Pos:MD@[-1]\n",
      "   7   7   0   1  | VB->NN if Pos:DT@[-1]\n",
      "   7   7   0   0  | VB->VBP if Pos:PRP@[-1]\n",
      "   7   7   0   0  | IN->WDT if Pos:-NONE-@[1] & Pos:VBZ@[2]\n",
      "   7   8   1   0  | IN->RB if Word:as@[2]\n",
      "   6   6   0   0  | VBD->VBN if Pos:VBP@[-2,-1]\n",
      "   6   6   0   1  | IN->WDT if Pos:-NONE-@[1] & Pos:VBD@[2]\n",
      "   5   5   0   0  | POS->VBZ if Pos:-NONE-@[-1]\n",
      "   5   5   0   0  | VB->VBP if Pos:NNS@[-1]\n",
      "   5   5   0   0  | VBD->VBN if Word:be@[-2,-1]\n",
      "   4   4   0   0  | POS->VBZ if Pos:``@[-2]\n",
      "   4   4   0   0  | VBP->VB if Pos:VBD@[-2,-1]\n",
      "   4   6   2   3  | RP->RB if Pos:CD@[1,2]\n",
      "   4   4   0   0  | RB->JJ if Pos:DT@[-1] & Pos:NN@[1]\n",
      "   4   4   0   0  | NN->VBP if Pos:NNS@[-2] & Pos:RB@[-1]\n",
      "   4   5   1   0  | VBN->VBD if Pos:NNP@[-2] & Pos:NNP@[-1]\n",
      "   4   4   0   0  | IN->WDT if Pos:-NONE-@[1] & Pos:MD@[2]\n",
      "   4   8   4   0  | VBD->VBN if Word:*@[1]\n",
      "   4   4   0   0  | JJS->RBS if Word:most@[0] & Word:the@[-1] & Pos:DT@[-1]\n",
      "   3   3   0   0  | VBD->VBN if Pos:VBN@[-1]\n",
      "   3   4   1   0  | VBN->VB if Pos:TO@[-1]\n",
      "   3   4   1   1  | IN->RB if Pos:.@[1]\n",
      "   3   3   0   0  | JJ->RB if Pos:VBD@[1]\n",
      "   3   3   0   0  | PRP$->PRP if Pos:TO@[1]\n",
      "   3   3   0   0  | NN->VBP if Pos:NNS@[-1] & Pos:DT@[1]\n",
      "   3   3   0   0  | VBP->VB if Word:n't@[-2,-1]\n",
      "Trained tbl tagger in 6.80 seconds\n",
      "    Accuracy on test set: 0.8572\n",
      "Tagging the test data\n"
     ]
    }
   ],
   "source": [
    "from nltk.tbl import demo as brill_demo\n",
    "\n",
    "brill_demo.demo()\n",
    "# print(open('errors.out').read())"
   ]
  },
  {
   "source": [
    "## 5.7 如何确定一个词的分类（词类标注）\n",
    "语言学家使用形态学、句法、语义来确定一个词的类别\n",
    "\n",
    "### 5.7.1 形态学线索：词的内部结构有助于词类标注。\n",
    "\n",
    "### 5.7.2 句法线索：词可能出现的典型的上下文语境。\n",
    "\n",
    "### 5.7.3 语义线索：词的意思\n",
    "\n",
    "### 5.7.4 新词（未知词）的标注：开放类和封闭类\n",
    "\n",
    "### 5.7.5 词性标记集中的形态学\n",
    "\n",
    "-   普通标记集捕捉的构词信息：词借助于句法角色获得的形态标记信息。\n",
    "-   大多数词性标注集都使用相同的基本类别。更精细的标记集中包含更多有关这些形式的信息。\n",
    "-   没有一个“正确的方式”来分配标记，只能根据目标不同而产生的或多或少有用的方法"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## 5.8 小结\n",
    "-   词可以组成类，这些类称为词汇范畴或者词性。\n",
    "-   词性可以被分配短标签或者标记\n",
    "-   词性标注、POS标注或者标注：给文本中的词自动分配词性的过程\n",
    "-   语言词料库已经完成了词性标注\n",
    "-   标注器可以使用已经标注过的语料库进行训练和评估\n",
    "-   组合标注方法：把多种标注方法（默认标注器、正则表达式标注器、Unigram标注器、N-gram标注器）利用回退技术结合在一起使用\n",
    "-   回退是一个组合模型的方法：当一个较为专业的模型不能为给定内容分配标记时，可以回退到一个较为一般的模型\n",
    "-   词性标注是序列分类任务，通过利用局部上下文语境中的词和标记对序列中任意一点的分类决策\n",
    "-   字典用来映射任意类型之间的信息\n",
    "-   N-gram标注器可以定义为不同数值的n，当n过大时会面临数据稀疏问题，即使使用大量的训练数据，也只能够看到上下文中的一部分\n",
    "-   基于转换的标注包括学习一系列的“改变标记s为标记t在上下文c中”形式的修复规则，每个规则都可以修复错误，但是也可能会引入新的错误"
   ],
   "cell_type": "markdown",
   "metadata": {}
  }
 ]
}